home *** CD-ROM | disk | FTP | other *** search
-
-
- An Information System for Corporate Users: Wide Area Information Servers
-
-
- Brewster Kahle
- Thinking Machines Corporation
- Brewster@think.com
- 245 First Street Cambridge MA 02142
-
- Art Medlar
- Scolex Information Systems
- 8 April 1991
- Version 3, TMC Tech Report TMC199, original in MSword
-
-
- To explore text-based information systems for corporate executives,
- four companies have jointly developed a prototype which gives flexible
- access to full-text documents. The four participating companies are
- Dow Jones & Co., with its premier business information sources;
- Thinking Machines Corporation, with its high-end information retrieval
- engines; Apple Computer, with its user interface expertise; and KPMG
- Peat Marwick, with its information-hungry user base.
-
- One of the primary objectives of the project is to allow a user to retrieve
- personal, corporate, and wide area information through one easy-to-use
- interface. For example, instead of using Lotus Magelleanfor personal
- information, Verity Topic for corporate data, and Dialog for published
- text, one application can access all three categories of information. The
- user isn't required to become familiar with several entirely different
- systems. In addition, since the interface consolidates data from many
- different sources, they can be manipulated effortlessly, virtually without
- regard to their origins.
-
- The Wide Area Information Server (WAIS, pronounced "ways") project is an
- experimental venture seeking to determine whether current technologies can
- be used to make profitable end-user full-text information systems. Fifteen
- users have been actively using the system for over three months. They have
- integrated it into their workday routine in much the same way as they have
- previously integrated spreadsheets and word processors. This preliminary
- success has convinced us that a WAIS-like system can be a valuable tool for
- corporate information retrieval. This paper discusses the design and
- implementation of the prototype system.
-
-
- Introduction
-
- Electronic publishing is the distribution of textual
- information over electronic networks. It has been emerging as a
- viable alternative to traditional print publishing as the necessary
- underlying technologies develop. Among the more essential of these
- are:
-
- High Resolution Display Screens
- Reliable, High-Speed Data Communications
- Desktop Publishing Systems
- Inexpensive Data Storage Media
-
- While these technologies have been developed for uses other than
- electronic publishing, they are the necessary precursors for full-text
- retrieval systems.
-
- From the users point of view, there are several problems to be
- overcome. First, there must be some way of finding and selecting
- databases from a potentially unlimited pool. Second, although these
- databases my be organized in different ways, the user should not need
- to become familiar with the internal configuration of each one.
- Finally, there must be some practical way of organizing responses on
- the users machine in order to maintain control over what may become a
- vast accumulation of data.
-
- In addition, developers are faced with a number of architectural
- issues. The system must be scalable; that is, it must allow for the
- future growth of both the complexity and number of clients and
- servers. It must be secure; each server's data must be protected from
- corruption, and the privacy of the users must be ensured. Lastly,
- since an unreliable source is useless in a corporate environment,
- access must be thoroughly robust.
-
-
- System Overview
-
- The prototype WAIS system takes advantage of current state-of-the-art
- technology, and presents solutions to all of the above problems. The system
- is composed of three separate parts: Clients, Servers, and the Protocol
- which connects them.
-
- The Client is the user interface, the server does the indexing and
- retrieval of documents, and the protocol is used to transmit the
- queries and responses, The client and server are isolated from each
- other through the protocol. Any client which is capable of
- translating a users request into the standard protocol can be used in
- the system. Likewise, any server capable of answering a request
- encoded in the protocol can be used. In order to promote the
- development of both clients and servers, the protocol specification is
- public, as is its initial implementation.
-
- On the client side, questions are formulated as English language
- questions. The client application then translates the query into the
- WAIS protocol, and transmits it over a network to a server. The
- server receives the transmission, translates the received packet into
- its own query language, and searches for documents satisfying the
- query. The list of relevant documents are then encoded in the
- protocol, and transmitted back to the client. The client decodes the
- response, and displays the results. The documents can then be
- retrieved from the server.
-
-
- Digital Researcher
-
- The traditional information research scenario is familiar to anyone
- who has ever visited a reference desk at a public or corporate
- library. The client approaches a librarian with a description of
- needed information. The librarian might ask a few background
- questions, and then draws from appropriate sources to provide an
- initial selection of articles, reports, and references. The client
- then sorts through this selection to find the most pertinent
- documents. With feedback from these trials, the researcher can refine
- the materials and even continue to supply the user with a flow of
- information as it becomes available. Monitoring which articles were
- useful can help keep the researcher on-track.
-
- The WAIS system is an attempt at automating this interaction: the user
- states a question in English, and a set of document descriptions come
- back from selected sources. The user can examine any of the items, be
- they text, picture, video, sound, or whatever. If the initial
- response is incomplete or somehow insufficient, the user can refine
- the question by stating it differently.
-
- In addition, the user may also mark some of the retrieved documents as
- being "relevant" to the question at hand, and then re-run the search.
- The server recognizes the marked documents, and attempts to find
- others which are similar to them. In the present WAIS system,
- "similar" documents are simply ones which share a large number of
- common words; however, there is potentially no upper limit on the
- intelligence of a server in determining what similarity entails. This
- method of information retrieval is called "relevance feedback." The
- idea has been around for many years1 and the first commercial system
- utilizing it, DowQuest2, was voted Database of the Year by Online
- Magazine in January 1989.
-
-
- User Interfaces: Asking Questions
-
- Users interact with the WAIS system through the Question interface.
- The interface may appear different on various implementations: for
- example, a character display terminal will have a different look than
- one which is capable of displaying bit-mapped graphics. The key,
- however, is that the user need only become familiar with one interface
- which provides access to all available information sources.
-
- The WAIS system, in this first incarnation, was designed to be used by
- accountants and corporate executives who are relatively untrained in
- search techniques. Consequently, to aid those users who have neither
- the time nor desire to learn a special purpose query language, the
- system uses English language queries augmented with relevance
- feedback. While the system's servers currently do not extract
- semantic information from the English queries, they do their best to
- find and rank articles containing the requested words and phrases.
- Used in conjunction with relevance feedback, this method of searching
- has proven to be more than adequate for the types of searches and
- databases typically encountered.
-
- The illustrations here are taken from the initial WAIStation program
- produced at Thinking Machines for the Apple Macintosh. Several other
- interfaces are under development at Apple Computer, Dow Jones, and
- elsewhere.
-
-
- Step 1: Sources are dragged with the mouse into the Question Window. A
- question can contain multiple sources. When the question is run, it asks
- for information from each included source.
-
-
- Step 2: When a query is run, headlines of documents satisfying the query
- are displayed.
-
-
- Step 3: With the mouse, the user clicks on any result document to retrieve
- it.
-
- Step 4: To refine the search, any one or more of the result documents can
- moved to the "Which are similar to:" box. When the search is run again,
- the results will be updated to include documents which are "similar" to the
- ones selected.
- Contacting Remote Sources of Information
-
- Figure 1: The Source description contains all the necessary information for
- contacting an information server.
-
- From the users point of view, a server is a source of information. It
- can be located anywhere that one's workstation has access to: on the
- local machine, on a network, or on the other side of a modem. The
- user's workstation keeps track of a variety of information about each
- server. The public information about a server includes how to contact
- it, a description of the contents, and the cost. In addition,
- individual users maintain certain private information about the
- servers they use. Users need to budget the money they are willing to
- spend on information from particular servers, they need to know how
- often and when each server is contacted, and they need to assess the
- relative usefulness of each server. This information helps guide the
- workstation in making cost effective decisions in contacting servers.
-
- With most current retrieval systems, complications develop as soon as
- one begins dealing with more than one source of information. The most
- common problem is that of asking a particular question. For example,
- one contacts the first source, asks it for information on some topic,
- contacts the next source, asks it the same questions (most likely
- using a different query language, a different style of interface, a
- different system of billing), contacts the next source, and so on.
- One of the primary motivations behind the initial development of the
- WAIS system was to replace replace all this with a single interface.
-
- With WAIS, the user selects a set of sources to query for information,
- and then formulates a question. When the question is run, the system
- automatically asks all the servers for the required information with
- no further interaction necessary by the user. The documents returned
- are sorted and consolidated in a single place. to be easily
- manipulated by the user. The user has transparent access to a
- multitude of local and remote databases.
-
-
- Rerunning Questions - A Personal Newspaper
-
- In addition to providing interactive access to a vast quantity of
- information, the WAIS system can also be used as a rudimentary
- personal newspaper. A virtually unlimited number of queries can be
- saved, and updated at periodic intervals. To do this, the user's
- workstation is directed to contact each server at certain set times.
- When a source of information is contacted, any questions referencing
- that source are updated with new documents. The users can then easily
- browse through the results the next morning.
-
- To make the ideal electronic personal newspaper, a system designer
- would need certain technologies which are not available today. Most
- computer screens are too small to allow efficient browsing of large
- amounts of text. Additionally, current data transmission speeds do
- not allow fast enough scanning if the text is not resident on the
- user's machine.
-
- Despite current limitations, the WAIS system employs a number of
- features which will be found in the personal newspaper of the future:
-
- Clear displays of which questions have new documents.
- Searches performed at night to hide communications delays.
- Documents stored on disk for future reference.
- Tools provided to quickly view stored documents.
-
- With these techniques, we have established a foundation of user
- support and acceptance.
-
-
- Servers
-
- The WAIS system was designed to be used by those who wish to sell
- information, as well as those who want to buy it. It provides a
- straightforward mechanism for indexing large amounts of data, making
- it available, and advertising the availability.
-
- The system is flexible enough to provide for a variety of billing
- methods. A small database maintainer might make the information
- available through a telephone connection. Using a 900 number, the
- billing would be taken care of by the phone company. A slightly more
- sophisticated site might have a password and credit card billing
- system. High volume servers might want to set up flat fee contracts
- with customers. Other methods will certainly emerge as use increases.
- The system was designed to be as adaptable as possible to future
- financial arrangements.
-
- As the dissemination of information becomes easier, questions of
- ownership, copyright, and theft of data must be addressed. These
- issues confront the entire information processing field, and are
- particularly acute here. The WAIS system is designed to keep control
- of the data in the hands of the servers. A server can choose to whom
- and when the data should be given. Documents are distributed with an
- explicit copyright disposition in their internal format. This is not
- to say that theft can not occur, but if a client starts to resell
- another's data, standard copyright laws can be invoked.
-
-
- The Directory of Servers
-
- As the WAIS system develops, sources of information will proliferate,
- making it impossible for any user to keep track of all servers that
- may be available at any one time. To help solve this problem,
- Thinking Machines is maintaining a Directory of Servers in a widely
- accessible location. The Directory of Servers contains
- indexed textual descriptions of all known servers. It is queried just
- like any other source. Instead of text documents, however, it returns
- source structures, specially formatted files which can be plugged into
- a question and used for queries.
-
- For example, suppose you needed information concerning the current
- gross national product of Mali, but had no idea where to find it. You
- might first ask the directory of servers for "information about the
- current economic condition of Mali." The directory would would return
- several documents, among them might be a source for the World
- Factbook, an on- line almanac maintained by the CIA. You would then
- use this document as the source field of a question, and re-run the
- query. This time, the system would contact the almanac, ask for the
- information, and return a document with the data you need.
-
- Additionally, the Directory of Servers provides a means for
- information providers to advertise the availability of their data.
- When a new source becomes available, the developers can submit a
- textual description, along with the necessary information for
- contacting the server. This information is added to the directory,
- and becomes available to the public.
-
-
- A Common Protocol for Information Retrieval
-
- One of the most far reaching aspects of this project is the
- development of an open protocol. The four companies have jointly
- specified a standard protocol for information retrieval. Creating a
- market where new servers can be readily established requires an open,
- publicly available protocol. Ideally this protocol would be an
- internationally standardized, yet flexible enough to adapt to new
- ideas and technologies; functioning over any electronic network, from
- the highest speed optical connections to phone lines.
-
- The use of an open and versatile protocol fosters hardware
- independence. This not only provides for a much wider base of users,
- it allows the system to seamlessly evolve over time as hardware
- technology progresses. It provides incentive to produce the best
- components possible. For example, the protocol provides for the
- transmission of audio and video as well as text, even though at
- present most workstations are unable to handle them. However, they
- are free to ignore pictures and sound returned in response to
- question, and to display and retrieve only text. This inability,
- though, does not hinder higher-end platforms from exploiting their
- greater processing power and network bandwidth.
-
- The WAIS protocol is an extension of the existing Z39.50 standard from
- NISO3. It has been augmented where necessary to incorporate many of
- the needs of a full- text information retrieval system4. To allow
- future flexibility, the standard does not restrict the query language
- or the data format of the information to be retrieved. Nonetheless, a
- query convention has been established for the existing servers and
- clients. The resulting WAIS Protocol is general enough to be
- implemented on a variety of communications systems.
-
- The success of a WAIS-like system depends on a critical mass of users
- and information services. In order to encourage development and use,
- Thinking Machines is not only publishing a specification for the
- protocol, but is also making the source code for a WAIS Protocol
- implementation freely available. While this software is available at
- no cost, it comes with no support. We hope that it will facilitate
- others in developing servers and clients.
-
-
- Future
-
- In developing the WAIS system, the participating companies have
- demonstrated that current hardware technology can be effectively used
- to provide sophisticated information retrieval services to novice
- end-users. How this might effect information providers is not yet
- completely understood. The users at Peat Marwick found the technology
- useful for day-to-day tasks such as researching potential new accounts
- and finding resources within their own organization. Since these
- tasks are not restricted to the accounting and management consulting
- industries, we are optimistic that this type of technology can be
- fruitful and productive in many corporate settings.
-
- The future of this system, and others like it, depends upon finding
- appropriate niches in the electronic publishing domain. Potential
- uses include making current online services more easily accessible to
- end-users; or allowing large corporations to access their own internal
- word processor files more efficiently. It is also possible that
- near-term development will focus on a single professional field such
- as patent law or medical research.
-
-
- Summary
-
- A unique alliance of four companies with complementary interests in
- the field of information retrieval have jointly developed a prototype
- which gives versatile access to full-text documents. The system
- allows users to retrieve personal, corporate, and wide area
- information through one easy-to-use interface. The WAIS project has
- shown that current technologies can be used to make useful,
- profitable, and convenient wide area information systems. The success
- of the project has convinced us that a WAIS-like system can be a
- valuable tool for corporate information retrieval.
-
-
- Acknowledgements
-
- The design and development of the WAIS Project has been a collective
- effort, with contributions and ideas coming from many people. Among
- them:
-
- Apple Computer: Charlie Bedard, David Casseras, Steve Cisler, Tom
- Erickson, Ruth Ridder, Eric Roth, John Thompson-Rohrlich, Kevin Tiene,
- Gitta Soloman, Oliver Steele, Janet Vratny-Watts. Dow Jones
- News/Retrieval: Clare Hart, Rod Wang, Roland Laird. Thinking
- Machines: Dan Aronson, Franklin Davis, Jonathan Goldman, Chris Madsen,
- Harry Morris, Patrick Bray, Danny Hillis, Gary Rancourt, Tracy Shen,
- Craig Stanfill, Steve Swartz, Ephraim Vishniac, David Waltz. KPMG
- Peat Marwick: Chris Arbogast, Mark Malone, Tom McDonough, Robin
- Palmer. Scolex Information Systems: Art Medlar. Thanks also to
- Advanced Software Concepts for TCPack software.
-
- For More Information
-
- Brewster Kahle Thinking Machines Corporation
- Thinking Machines Corporation 245 First Street
- 1010 El Camino Real, Suite 310 Cambridge, MA 02142
- Menlo Park, CA 94025 617-234-1000
- 415-329-9300 X228
- brewster@Think.com
-
-
- 1 Salton, Gerald; McGill, Micheal. Introduction to Modern Information
- Retrieval. McGraw-Hill, 1983.
-
- 2 DowQuest promotional literature available from Dow Jones & Co. Inc.,
- 200 Liberty Street, New York, NY 10281.
-
- 3 Z39.50-1988: Information Retrieval Service Definition and Protocol
- Specification for Library Applications. National Information
- Standards Organization (Z39), P.O. Box 1056, Bethesda, MD 20817.
- (301) 975-2814. Available from Document Center, Belmont, CA.
- Telephone 415-591-7600.
-
- 4 Franklin Davis et al. WAIS Interface Protocol Prototype Functional
- Specification, Thinking Machines. Available from Franklin Davis
- (fad@think.com) or Brewster Kahle (brewster@think.com).
-
-
-